Add HealthCheckRuntime context manager for shared boilerplate [1/2] by gustcol · Pull Request #77 · facebookresearch/gcm

gustcol · 2026-03-01T11:41:30Z

Summary

Ref: #75

Introduce HealthCheckRuntime, a @dataclass context manager that encapsulates the ~30 lines of repeated setup code every health check subcommand duplicates: logger initialization, GPU node ID detection, derived cluster resolution, ExitStack with TelemetryContext + OutputContext, and killswitch checking
Reduces per-subcommand boilerplate from ~30 lines to ~5 lines (a with HealthCheckRuntime(...) as rt: block)
Purely additive — existing checks continue to work unchanged; no migration in this PR

Stacked PR series: [1/2] Runtime helper → [2/2] Scaffold tool (depends on this PR)

Before (~30 lines per subcommand)

node = socket.gethostname()
logger, _ = init_logger(...)
try: gpu_node_id = gni_lib.get_gpu_node_id()
except: gpu_node_id = None; ...
derived_cluster = get_derived_cluster(...)
exit_code = ExitCode.UNKNOWN
msg = ""
with ExitStack() as s:
    s.enter_context(TelemetryContext(...))
    s.enter_context(OutputContext(...))
    ff = FeatureValueHealthChecksFeatures()
    if ff.get_healthchecksfeatures_disable_check_sensors():
        exit_code = ExitCode.OK; ...
    # actual check logic
    sys.exit(exit_code.value)

After (~5 lines per subcommand)

with HealthCheckRuntime(
    cluster=cluster, type=type, log_level=log_level,
    log_folder=log_folder, sink=sink, sink_opts=sink_opts,
    verbose_out=verbose_out,
    heterogeneous_cluster_v1=heterogeneous_cluster_v1,
    health_check_name=HealthCheckName.CHECK_SENSORS,
    killswitch_getter=lambda: FeatureValueHealthChecksFeatures()
        .get_healthchecksfeatures_disable_check_sensors(),
) as rt:
    # actual check logic using rt.logger, rt.node, etc.
    rt.finish(exit_code, msg)

Test plan

nox -s tests -- gcm/tests/health_checks_tests/test_runtime.py — 6 tests covering initialization, killswitch behavior, finish(), context nesting, GPU node ID failure
nox -s lint
nox -s format
nox -s typecheck

…plate Extract the ~30 lines of repeated setup code (logger init, GPU node ID detection, derived cluster resolution, TelemetryContext + OutputContext nesting, killswitch check) into a reusable HealthCheckRuntime dataclass context manager. This reduces per-subcommand boilerplate from ~30 lines to ~5 lines. The helper is purely additive — existing checks continue to work unchanged. New checks can use `with HealthCheckRuntime(...) as rt:` instead of manually wiring up the setup ceremony. Includes comprehensive tests covering field initialization, killswitch behavior, context manager nesting, GPU node ID failure handling, and the finish() convenience method. Refs: facebookresearch#75

github-actions · 2026-03-01T11:41:40Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

Apply ufmt formatting and fix mypy errors in test helper by using explicit typed parameters instead of **kwargs dict unpacking.

luccabb · 2026-03-12T01:15:50Z

@claude

claude · 2026-03-12T01:16:08Z

Claude finished @luccabb's task in 2m 23s —— View job

Code Review

Gather context and read CLAUDE.md
Read changed files (runtime.py, test_runtime.py)
Review code and provide feedback

Overall this is a well-structured abstraction that genuinely reduces boilerplate. The dataclass + context manager pattern is clean, test coverage is solid, and the finish() method works correctly (calling sys.exit inside the with body means __exit__ is properly invoked). I found one correctness bug and a few smaller issues below.

Bug: `sys.exit()` inside `enter` silently skips telemetry and output

runtime.py lines 97–103

if self.killswitch_getter():
    self.exit_code = ExitCode.OK
    self.logger.info(...)
    sys.exit(0)   # ← raises SystemExit from inside __enter__

When sys.exit(0) is called in __enter__, Python's with statement does not call __exit__ — __exit__ is only invoked when __enter__ returns successfully. This means self._stack.__exit__ is never called, so TelemetryContext.__exit__ (which writes telemetry records) and OutputContext.__exit__ (which prints the check result) are silently skipped.

This is a behavioral regression from the original code in check_sensors.py, where the killswitch check lives inside the with ExitStack() block — meaning SystemExit propagates through the with block body, the ExitStack is properly torn down, and both contexts get their __exit__ called.

The simplest fix is to drain the stack before exiting:

if self.killswitch_getter():
    self.exit_code = ExitCode.OK
    self.logger.info(
        "%s is disabled by killswitch", self.health_check_name.value
    )
    self._stack.__exit__(None, None, None)  # flush telemetry + output
    sys.exit(0)

Both TelemetryContext.__exit__ and OutputContext.__exit__ handle (None, None, None) correctly (they read from get_exit_code_msg() which captures self.exit_code, already set to ExitCode.OK above). Fix this →

Missing test coverage for the above bug

test_runtime.py lines 66–83 (test_killswitch_enabled_exits_ok)

The test verifies SystemExit(0) is raised but does not assert that TelemetryContext.__exit__ and OutputContext.__exit__ were called. Adding assertions for both mock_telem_instance.__exit__ and mock_output_instance.__exit__ (similar to test_telemetry_and_output_contexts_entered) would have caught the bug above. Fix this →

Minor: direct `enter`/`exit` calls on `_stack` warrant a comment

runtime.py lines 72–73

self._stack = ExitStack()
self._stack.__enter__()

Calling dunder methods directly is unusual and can confuse readers. A brief comment like # Use ExitStack manually so it spans both __enter__ and __exit__ would make the intent clear.

Nit: missing docstrings on public class and `finish()`

HealthCheckRuntime and finish() have no docstrings. Given that this is intended as a shared utility that other check authors will use, a short class-level docstring and a finish() docstring (noting that it calls sys.exit and is therefore NoReturn) would help discoverability.

Nit: mixed logging style

runtime.py line 63 uses an f-string:

self.logger.warning(f"Could not get gpu_node_id, likely not a GPU host: {e}")

while the surrounding logger.info calls at lines 52–57 use %s formatting. The existing check_sensors.py uses the same f-string for this specific warning, so it's not wrong — just inconsistent within the file.

Summary

Severity	Finding
🔴 Bug	`sys.exit()` in `__enter__` bypasses `TelemetryContext`/`OutputContext` cleanup on killswitch
🟡 Test gap	`test_killswitch_enabled_exits_ok` doesn't verify context `__exit__` is called
🟢 Minor	Direct `_stack.__enter__()` call needs a comment
🟢 Nit	Missing docstrings on class and `finish()`
🟢 Nit	Mixed `%s` vs f-string logging style

gustcol requested review from A-Kokolis, calebho, giongto35, jj10306 and luccabb as code owners March 1, 2026 11:41

meta-cla bot added the cla signed label Mar 1, 2026

This was referenced Mar 1, 2026

Add scaffold tool for creating new health checks [2/2] #78

Open

[rfc] Simplifying the New Health Check #75

Open

Fix formatting (ufmt) and type annotations (mypy)

a076093

Apply ufmt formatting and fix mypy errors in test helper by using explicit typed parameters instead of **kwargs dict unpacking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HealthCheckRuntime context manager for shared boilerplate [1/2]#77

Add HealthCheckRuntime context manager for shared boilerplate [1/2]#77
gustcol wants to merge 2 commits intofacebookresearch:mainfrom
gustcol:feature/health-check-runtime

gustcol commented Mar 1, 2026

Uh oh!

github-actions bot commented Mar 1, 2026

Uh oh!

luccabb commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gustcol commented Mar 1, 2026

Summary

Before (~30 lines per subcommand)

After (~5 lines per subcommand)

Test plan

Uh oh!

github-actions bot commented Mar 1, 2026

CI Commands

Uh oh!

luccabb commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Bug: sys.exit() inside __enter__ silently skips telemetry and output

Missing test coverage for the above bug

Minor: direct __enter__/__exit__ calls on _stack warrant a comment

Nit: missing docstrings on public class and finish()

Nit: mixed logging style

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Mar 12, 2026 •

edited

Loading

Bug: `sys.exit()` inside `enter` silently skips telemetry and output

Minor: direct `enter`/`exit` calls on `_stack` warrant a comment

Nit: missing docstrings on public class and `finish()`